Mini-Tutorial 1: Data Visualization Concepts
Introduction
“All work presented is my own. I have not communicated with or worked with anyone else on this exam.”
In this mini-tutorial I will show you the concepts relating to the Grammar of Graphics as well as dive into data visualization concepts from Data Visualization: A Practical Introduction. Going through this section will show you the foundation of data visualization and the reasons that this is so important to understand in the context of data.
Grammar of Graphics
The Grammar of Graphics includes these main concepts: Data, Geom, Mapping, and Faceting. Grammar of graphics is super important because they are the tools that all data vizualists use to create plots and other graphics. These tools let you build and customize any plot to meet the needs of a project and allow your audience to understand and analyze the data.
Data
library(tidyverse)## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.5 ✓ dplyr 1.0.7
## ✓ tidyr 1.1.4 ✓ stringr 1.4.0
## ✓ readr 2.0.2 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
standings_df <- read_csv("data/standings.csv")## Rows: 638 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): team, team_name, playoffs, sb_winner
## dbl (11): year, wins, loss, points_for, points_against, points_differential,...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
standings_df## # A tibble: 638 × 15
## team team_name year wins loss points_for points_against points_differen…
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Miami Dolphins 2000 11 5 323 226 97
## 2 India… Colts 2000 10 6 429 326 103
## 3 New Y… Jets 2000 9 7 321 321 0
## 4 Buffa… Bills 2000 8 8 315 350 -35
## 5 New E… Patriots 2000 5 11 276 338 -62
## 6 Tenne… Titans 2000 13 3 346 191 155
## 7 Balti… Ravens 2000 12 4 333 165 168
## 8 Pitts… Steelers 2000 9 7 321 255 66
## 9 Jacks… Jaguars 2000 7 9 367 327 40
## 10 Cinci… Bengals 2000 4 12 185 359 -174
## # … with 628 more rows, and 7 more variables: margin_of_victory <dbl>,
## # strength_of_schedule <dbl>, simple_rating <dbl>, offensive_ranking <dbl>,
## # defensive_ranking <dbl>, playoffs <chr>, sb_winner <chr>
This data of NFL Standings from 2011 to 2019. Data is a major part of data visualization. It is a required parameter, but it is also important for you to understand the data that you are using before you make any graph.
Geom
The geom is the geometric object that you are using to represent the data. There are many different types of graphs that you could use to represent this data, but it is important that your graph represents the data well. That is why it is important to understand your data before you make these graphs.
NE_df <- standings_df %>% filter(team == "New England")
ggplot(data = NE_df, aes(x = year, y = wins)) +
geom_col() ### This bar graph shows the number of New England Patriot wins through the years
ggplot(data = NE_df, aes(x = year, y = points_for)) +
geom_point() ### This scatter plot shows the amount of points the New England Patriots have scored from 2000 to 2019.
Mapping
Mapping includes all of the aesthetic functions that are available in ggplot in R. These include but are not limited to variables (x and y), color, size, shape, etc. We have already seen some mapping in the above examples, but there is much more that you can do like in these examples.
ggplot(data = standings_df, aes(x= wins, colour = team, fill = team)) +
geom_bar() ### As you can see through this mapping I have been able to add color to represent each NFL team. However, I have made a graph that is really difficult to read and have an audience understand what they are looking at. This is allows me to introduce another important topic Faceting.
Faceting
For this section I have cut the data set down to just AFC east teams.
AFCEast_df <- standings_df %>% filter(team_name == "Patriots" | team_name == "Bills" | team_name == "Jets" | team_name == "Dolphins")
AFCEast_df## # A tibble: 80 × 15
## team team_name year wins loss points_for points_against points_differen…
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Miami Dolphins 2000 11 5 323 226 97
## 2 New Y… Jets 2000 9 7 321 321 0
## 3 Buffa… Bills 2000 8 8 315 350 -35
## 4 New E… Patriots 2000 5 11 276 338 -62
## 5 New E… Patriots 2001 11 5 371 272 99
## 6 Miami Dolphins 2001 11 5 344 290 54
## 7 New Y… Jets 2001 10 6 308 295 13
## 8 Buffa… Bills 2001 3 13 265 420 -155
## 9 New Y… Jets 2002 9 7 359 336 23
## 10 New E… Patriots 2002 9 7 381 346 35
## # … with 70 more rows, and 7 more variables: margin_of_victory <dbl>,
## # strength_of_schedule <dbl>, simple_rating <dbl>, offensive_ranking <dbl>,
## # defensive_ranking <dbl>, playoffs <chr>, sb_winner <chr>
ggplot(data = AFCEast_df, aes(x = year, y = wins, colour = team, fill = team)) +
geom_col() +
facet_wrap(~ team_name) ### Through faceting we can make a much more organized graph that allows the people looking at our analysis to get a better idea of what they are looking at. The graph in the prior section was very unorganized however the facet_wrap function allows up to see the win totals for these AFC East teams much easier. Additionally, it is much easier for a viewer of this faceted plot to compare the number of wins between these four teams
Problems with Honesty and Good Judgement
As a data visualist you see graphs and representations of data on an everyday basis. Through this repetition you can understand how to read graphs, and can spot when things are off. Unfortunately, not all people have the skills to accurately read or understand what a graph is showing. In this section we will use an Happy Planet Index data set to show some of the good practices to allow people to interpret graphs and data.
hpi_df <- read_csv("data/hpi-tidy.csv")## Rows: 151 Columns: 11
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): Country, GovernanceRank, Region
## dbl (8): HPIRank, LifeExpectancy, Wellbeing, HappyLifeYears, Footprint, Happ...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
hpi_LE_df <- hpi_df %>% group_by(Region) %>%
summarise(mean_LE = mean(LifeExpectancy))%>%
select(Region, mean_LE) %>%
arrange(desc(mean_LE)) %>%
group_by(Region) %>%
mutate(LEorder = fct_reorder(Region, mean_LE))
ggplot(data = hpi_LE_df, aes(x = LEorder, y = mean_LE, fill = LEorder)) +
geom_col() +
coord_flip() +
scale_colour_viridis_b() +
labs(x = "Region",
y = "Mean Life Expectancy") +
theme(legend.position = "none") ### We can see in this graph the mean life expectancies for each region in the data set. Although this graph is not necessarily misleading, for the population that does not have the skills that we have this could be tricky to figure out and interpret correctly.
hpi_df <- read_csv("data/hpi-tidy.csv")## Rows: 151 Columns: 11
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): Country, GovernanceRank, Region
## dbl (8): HPIRank, LifeExpectancy, Wellbeing, HappyLifeYears, Footprint, Happ...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
hpi_LE_df <- hpi_df %>% group_by(Region) %>%
summarise(mean_LE = mean(LifeExpectancy))%>%
select(Region, mean_LE) %>%
arrange(desc(mean_LE)) %>%
group_by(Region) %>%
ungroup() %>%
mutate(LEorder = fct_reorder(Region, mean_LE))
ggplot(data = hpi_LE_df, aes(x = LEorder, y = mean_LE, fill = LEorder)) +
geom_col() +
coord_flip() +
scale_colour_viridis_b() +
labs(x = "Region",
y = "Mean Life Expectancy") +
theme(legend.position = "none") ### The difference in this graph is that I have put the regions in decsending order based on life expectancy. This allows people to clearly see the order in which the mean life expectancy is presented. Additionally, a major factor in allowing the audience to understand these bar charts is having a zero base. Notice all of these bars start at zero, this allows the viewer to understand this graph that much easier because they do not need to interpret the starting point of each bar.
Good Data
Having good data that is representative of the topic that you are trying to cover is extremely important when showing your charts in graphs to other people. There may be many times when missing data could make very small changes to your results, but there are other times where this type of data can lead to massive changes to your analysis. In this section we will you the Happy Planet Index once again to show some of the effects of not having good, well represented data.
library(plotly)##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
plot_full <- ggplot(data = hpi_df, aes(x = GDPcapita, y = Wellbeing, label = Country)) +
geom_point() +
geom_smooth()
ggplotly(plot_full, tooltip = "label")## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
hpi_df## # A tibble: 151 × 11
## HPIRank Country LifeExpectancy Wellbeing HappyLifeYears Footprint
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 109 Afghanistan 48.7 4.76 29.0 0.540
## 2 18 Albania 76.9 5.27 48.8 1.81
## 3 26 Algeria 73.1 5.24 46.2 1.65
## 4 127 Angola 51.1 4.21 28.2 0.891
## 5 17 Argentina 75.9 6.44 55.0 2.71
## 6 53 Armenia 74.2 4.37 41.9 1.73
## 7 76 Australia 81.9 7.41 65.5 6.68
## 8 48 Austria 80.9 7.35 64.3 5.29
## 9 80 Azerbaijan 70.7 4.22 39.1 1.97
## 10 146 Bahrain 75.1 4.55 43.5 6.65
## # … with 141 more rows, and 5 more variables: HappyPlanetIndex <dbl>,
## # Population <dbl>, GDPcapita <dbl>, GovernanceRank <chr>, Region <chr>
In this graph with all the countries we can see a nice representative relationship between GDP per Capita and Well Being. However, if we take out some of the data, making it not representative we see a much different story.
hpi_not_full_df <- hpi_df %>% slice(1:76)
hpi_not_full_df## # A tibble: 76 × 11
## HPIRank Country LifeExpectancy Wellbeing HappyLifeYears Footprint
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 109 Afghanistan 48.7 4.76 29.0 0.540
## 2 18 Albania 76.9 5.27 48.8 1.81
## 3 26 Algeria 73.1 5.24 46.2 1.65
## 4 127 Angola 51.1 4.21 28.2 0.891
## 5 17 Argentina 75.9 6.44 55.0 2.71
## 6 53 Armenia 74.2 4.37 41.9 1.73
## 7 76 Australia 81.9 7.41 65.5 6.68
## 8 48 Austria 80.9 7.35 64.3 5.29
## 9 80 Azerbaijan 70.7 4.22 39.1 1.97
## 10 146 Bahrain 75.1 4.55 43.5 6.65
## # … with 66 more rows, and 5 more variables: HappyPlanetIndex <dbl>,
## # Population <dbl>, GDPcapita <dbl>, GovernanceRank <chr>, Region <chr>
plot_not_full <- ggplot(data = hpi_not_full_df, aes(x = GDPcapita, y = Wellbeing, label = Country)) +
geom_point() +
geom_smooth()
ggplotly(plot_not_full, tooltip = "label")## `geom_smooth()` using method = 'loess' and formula 'y ~ x'